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Abstract 

In many problems in data mining and machine learning, data items that need to be clustered or 
classified are not points in a high-dimensional space, but are distributions (points on a high dimen- 
sional simplex). For distributions, natural measures of distance are not the l v norms and variants, but 
information-theoretic measures like the Kullback-Leibler distance, the Hellinger distance, and others. 
Efficient estimation of these distances is a key component in algorithms for manipulating distributions. 
Thus, sublinear resource constraints, either in time (property testing) or space (streaming) are crucial. 

In this paper we design streaming and sublinear time property testing algorithms for entropy and 
various information theoretic distances. We start by resolving two open questions regarding property 
testing of distributions. Firstly, we show a tight bound for estimating bounded, symmetric /-divergences 
between distributions in a general property testing (sublinear time) framework (the so-called combined 
oracle model). This yields optimal algorithms for estimating such well known distances as the Jensen- 
Shannon divergence and the Hellinger distance. Secondly, we close a (log n)/H gap between upper 
and lower bounds for estimating entropy H in this model. We provide an optimal algorithm over all 
values of the entropy, and for small entropy the improvement is significant. In a stream setting (sublinear 
space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in 
polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part 
of the algorithm is an interesting use of Fq (the number of distinct elements) estimation algorithms; we 
also provide other results along the space/time/approximation tradeoff curve. 

Our results have interesting structural implications that connect sublinear time- and space-constrained 
algorithms. The mediating model is the random order streaming model, which assumes the input is a 
random permutation of a multiset and was first considered by Munro and Patterson. We show that any 
property testing algorithm in the combined oracle model for permutation invariant functions can be sim- 
ulated in the random order model in a single pass. This addresses a question raised by Feigenbaum et al 
regarding the relationship between property testing and stream algorithms. Further we give a polylog- 
space PTAS for the estimating the entropy of a one pass random order stream. This bound cannot be 
achieved in the combined oracle model. 
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1 Introduction 



There are many settings where the natural unit of data, rather than being a point in a high dimensional vector 
space, is a distribution defined on n items. Examples include soft clustering [36], where the membership of 
a point in a cluster is described by a distribution, and anomaly detection [30], where the distance between 
two empirical distributions is used to detect anomalies. Typically, such settings involve large data sets, and 
so a natural requirement is that algorithms use small amounts of resources (space or time). 

In this paper, we examine sublinear algorithms for estimating properties of distributions. On the one 
hand we study the complexity of estimating information theoretic distances and measures on distributions, 
e.g., entropy, Jensen-Shannon divergence, Hellinger and Triangular distances, to name a few, and on the 
other, we explore the connections between various models in sublinear algorithms, e.g., property testing 
models, and data streams. We discuss both these aspects below. We will not be able to review the extensive 
literature on either of these topics; however several good surveys, e.g., by Ron [35], Babcock et al [4| and 
Muthukrishnan [34|, exist. 

1.1 Problems 

When dealing with distributions, distances arising from information-theoretic considerations are often more 
natural than distances based on £ p norms and the like. In the first half of the paper we focus on the Ali- 
Silvey distances or /-divergences, discovered independently by Csiszar 1181 . and Ali and Silvey [1[. The 
class of /-divergences include many commonly used information theoretic distances, e.g., the (asymmet- 
ric) Kullback-Liebler (KL) divergence 1 and its symmetrization the Jensen-Shannon (JS) divergence, Mat- 
susita's Divergence or the squared Hellinger distance, the (asymmetric) \ 2 distance and its symmetriza- 
tion, the Triangle distance. In fact for any convex function / we can define the /-divergence Df(q,p) = 
^2xcnP(. x )f (q( x )/p( x )) ^ /(I) = and / is strictly convex at l. 2 

Results of Csiszar [19], Liese and Vajda [31 1, Amari [3] and many others show that /-divergences are 
the unique class of distances on distributions that arise from a fairly simple set of axioms, e.g., permuta- 
tion invariance, non-decreasing projections, certain direct sum theorems etc., in much the same way that 
£2 is a natural measure for points in R n . Moreover, all of these distances are related to each other (via 
the Fisher information matrix) [13] in a way that other plausible measures (most notably £2) are not. In 
addition, the log-likelihood ratio In is a crucial parameter in Neyman-Pearson style hypothesis testing 
ifTTl. and distances based on this (like the KL-distance and the JS-distance) appear as exponents of error 
probabilities for optimal classifiers. Recently, these distance measures have been used in more algorithmic 
contexts, as natural distances for clustering distributional data ]36]|^|3). Batu et al fLT l gave algorithms 
for testing closeness of distributions for the £\ and £2 distances, and raised the question of testing closeness 
of distributions under the JS -divergence. 

In this paper we provide optimal (upto constants) algorithms for testing /-divergences of distributions. 
We consider the problem of estimating the entropy H of a distribution, providing optimal (upto constants) 
upper bounds for testing entropy. This improves the previous result of Batu et al [11 ] by a factor ^^y- 
Entropy is naturally related to the JS-divergence since JS(p,q) = ln2(2H((p + q)/2) — H(p) — H(q)) 
where (p + q)/2 is the average of the two distributions. 

Switching from sublinear time to sublinear space, we then focus on the streaming model and develop a 

1 Many of the measures we consider in this paper are not metrics - and several authors use constant multiples of the definitions 
in this paper. Traditionally, the term 'divergence' has been used to distinguish such measures from distances and metrics. We will 
use the terms 'distance' and 'divergence' interchangeably; a distance is not a metric unless explicitly mentioned. 

2 We can easily verify that f(u) — u In u gives us the KL divergence; f(u) — (ln(2/(l + u)) + u ln(2u/(l + u))) gives us the 
Jensen-Shannon (JS) divergence. The asymmetric Pearson's \ 2 distance is realized with f(u) — (it— l) 2 and is symmetrized to the 
Triangle distance with/(ii) = (u— l) 2 /(u+l). Matsusita's Divergence or the (squared) Hellinger distance has f(u) = {-\/u— l) 2 . 
The l\ or variational distance is realized with f(u) — \u— 1|. 
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one pass polylogarithmic space PTAS for estimating entropy in the random streams model, which assumes 
that the input is a random permutation of some fixed multiset. We subsequently derive several (regular) 
streaming algorithms that give a three way tradeoff between space, approximation, and number of passes. 
We note that these algorithms naturally imply (weaker) tradeoffs in JS distance and omit further discussion. 

1.2 Models 

As it turns out, sublinear algorithms for testing distributions reveal interesting structure about the relation- 
ship between property testing and stream algorithms. Feigenbaum et al 1221 considered the problem of 
property testing in a data stream model. They showed that there exist functions (e.g., SORTED-SiJPERSET, 
a variant of permutation, l22lD that are easy in the property testing model but hard to test in streams. This 
was surprising since many sampling based techniques can be extended to data streams. For example, Bar- 
Yossef et al flOl showed that non-adaptive sampling can be easily simulated in an aggregate (all occurrences 
of item i are grouped together) streaming model with a small blowup in space. The aggregation assumption 
can be removed with an extra pass. 

We show that in fact these (variants of permutations) are the only hard functions. Specifically, we show 
that any property testing algorithm for a permutation invariant (also known as symmetric) function, in the 
combined oracle model can be simulated by a single pass data stream algorithm that assumes a random 
permutation of the input. The random permutation assumption can be removed using an extra pass to give 
a two-pass simulation in the regular streaming model. The simulation builds upon the reductions used by 
Bar-Yossef et al |8j|6l[7| in deriving strong lower bounds for sampling. However we use the reductions for 
upper bounds. 

In a natural sense, if we exclude permutation dependent functions, stream testing in the random permu- 
tation model subsumes combined oracle property testing, and it is the testing of entropy that reveals this 
difference between the models. 

2 Definitions 

Definition 1. Let p and q be two discrete probability distributions defined on base [n]. The /-divergence 
47#1/ between pond q is defined as Df(p,q) = ^Pif(qi/pi)for some function f ( convex, f ( 1 ) = 0). Many 
commonly used distance measures are /-divergences, including the i\ distance, the Hellinger distance 3 
Hellinger(p, q) = Y^i(\/P~i ~ y/qi) 2 , the Jensen-Shannon distance JS(p, q) = Y^iPi m + H ln 
and the Triangle distance A(p, q) = ^\ ^^- > . 

Definition 2. The entropy of a distribution is defined as H{p) = J2i Pi l°g h- (AH logs are base 2.) 

Accurately estimating entropy is an interesting problem in its own right as well from the context of the 
statistical distances. Throughout we will make use of the following lemma. 

Lemma 3 (Concentration of Independent Random Variables |25|). Let {X t }i<t< m be independently 
distributed random variables with (continuous) range [0,u]. Let X = 2~2i<t<m^- Then for 7 > 0, 

F{\X-E(X)\ >7E(JC)) < 2exp( ~ 72E 3 ( f i)?n ). 

3 Property Testing 

3.1 Oracle Models for Property Testing of Distributions 

Two main oracle models have been used in the property testing literature for testing properties of distribu- 
tions. These are the generative and evaluative models introduced by Kearns et al [29|. The black-box or 
generative model of a distribution permits only one operation: taking a sample from the distribution. In 

3 Note that the Hellinger distance is sometimes defined as the square-root of the above quantity. 
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other words, given a distribution p = {pi, . . .p n }, sample(p) returns i with probability p^. In the evalua- 
tive model, a probe operation is permitted. probe(p, i) returns pi. A natural third model, the combined 
model was introduced by Batu et al fTTl . In this model both the sample and probe operations are per- 
missible. In all three models, the complexity of an algorithm is measured by the number of operations. 



3.2 Testing Jensen-Shannon, Hellinger and Triangle Divergences (Generative Oracle) 

In this section we consider property testing in the generative model for various information theoretic dis- 
tances. We will present the results for the Triangle Divergence A. However, the Jensen-Shannon and 
Hellinger divergences are constant factor related to the Triangle divergence as follows: 

Hellinger(p,g)/2 < A(p,q)/2 < JS(p,q) < ln2A(p,g) < 2 In 2Hellinger(p, q) 

(Parts of equation 13.21 are proved in (37).) Therefore the results presented here naturally imply analogous 
results for them as well. Our algorithm is similar to that in lfl2l . and is presented in Figure [2 It relies on an 
£2 tester given in [ 12 1. Central to the analysis are the following inequalities. 

Lemma 4 (£2 Testing 1 12 1). There exists an algorithm that given distributions p, q draws s = 0(log | (b 2 + 
e 2 \/6) /e 4 ) samples (where b = maxj(pj, qi) ) and if ' ^{p, <?) < the algorithms passes with probability 
at least 1 — 5, but if£2(p, q) > e the algorithm passes with probability less than 5. 



Algorithm A-Test 


1. 


Draw m samples from p and from q 


2. 


= # times element i appears and pi = n^/m 


3. 


n\ = # times element i appears and qi = nf/m 


4. 


Let S — {i : max{n^, nf } > mn _Q } 


5. 


return "Fail" if £. £S > e/10 


6. 


Define p' (and q' analogously) as follows: 


7. 


i <— sample(p) 


8. 


if i ^ S, output i else output j uniformly chosen from [n] 


9. 


return £2 -Tester ran on p' , q' and -7= 



Figure 1 : A-Testing in the Generative Model 



The first lemma follows from Chernoff bounds. The rest of the proofs are in Appendix lAl 

Lemma 5. We say an estimate is heavy if it is greater than l/n a . Then, with m = 0(log \ - ^ g - ) samples, 
with probability 1 — 6/2, for all heavy estimate pi, pi is at most pi~f /100 from p^ Furthermore, if at least 
one of pi or q~{ is heavy then we can estimate ^q^j up to j = ^ max fa' g ^ - 

Lemma 6 (A Testing). For two distributions P and Q, then there exists an algorithm drawing 

r\n 1 r n a logn . -2a+2 , 2 1— a/2\ 1 4-1 \ 

s = Oflog - max{ 5 — An ^ +en ' )/e }) 

e z 

samples such that if A(P,Q) < e 2 /n 1_Q , the algorithms passes with probability at least 1 — 5, but it 
A(P, Q) > e the algorithm passes with probability less than 5. 

Observe that setting a = 2/3 yields an algorithm with sample complexity (9(n 2 / 3 /e 4 ). 
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3.3 Testing all Symmetric Bounded /-Divergences (Combined Oracle) 

In this section we consider property testing in the combined oracle model for all symmetric bounded f- 
divergences. Recall that a convex function / defines a divergence Df(p, q) = ^2iPif(—)l this encodes the 
permutation invariance. We are interested in symmetric (over p, q) divergences, i.e., Df(p, q) = Df(q,p). 
We define a divergence to be bounded 4 if max{/(u), u/(i)} = r < oo. JS, Hellinger, A, i\ all are bounded 
and satisfy r = 2. We show an interesting decomposition property of symmetric / divergences. Define a 
conjugate f*(u) = uf(^). It can be verified that JS, Hellinger, A, i\ are self conjugates, i.e., f*(u) = 
One useful characterization of symmetric /-divergence is the following: 

Lemma 7 (|3l)). Df(p,q) = Df(q,p) iff f*(u) = f (u) — c(u — 1) for some constant c. 

Therefore using / is the same as using /(")+/ (u)-2c(u-i) ^ ^ smce ^2 i p i c(&- — 1) = cQ2i Qi ~ Yli Pi) = 
c(l — 1) = 0, we may as well use g(u) = £i^hJLJ^l_ We now claim the following: 

Lemma 8. Any symmetric divergence f, can be expressed as Dj(p,q) = Pi9( x ) + ^ ] qig(l/x) 

r-Vi>1i i-Qi>Pi 

where g(x) = \(f(x) + xf(^)) and x = & Further, if f is bounded then g(x) < r ifx 6 [0, 1]. 

Although the above appears simple, it actually allows us to break the divergence into small, positive com- 
ponents. This allows us to use sharp concentration bounds. 

Algorithm Combined Oracle Distance Testing 

1. £^0fori= ltora: 

2. do i <— sample(p) and x — probe(g, i)/probe(p, i) 

3. If x > 1 then a «— g(x) else a <— 

4. j <— sample(<7) and x = probe(g, j)/probe(p, j) 

5. If a; < 1 then b <— g(l/x) else 6^0 

6. £ <- (a + 6)/2r + £; 

7. Estimate the distance as 2rE/m 

Figure 2: Combined Oracle Distance Testing 



Theorem 9. For a bounded symmetric f -divergence Df in the combined oracle model we can estimate 
Df(p,q) upto a factor (1 + e) in 0(t / '(e 2 D f(p, q))) time, (t = 0(1) for t\, Hellinger, A, and JS.) 

Proof. Consider the value added to E in each iteration. This is a random variable with range [0, 1] and 
mean Df ^' q) . Hence by LemmaOl P (\E - m Df £' q) \ < em D/ 2 ( f' g) ) < 2 exp(-e 2 D f (p, g)m/6r). There- 
fore with 0(T/(e 2 Df(p, q))) samples/probes the probability that we do not estimate Df(p, q) as required 
can be made arbitrarily small. □ 

Note that although £2 is not an /-divergence, setting a = pi(l — qi/pi) 2 and b = qi(pi/qi — l) 2 in 
the above we can estimate £ 2 (p, q) in 0(l/(e 2 i2(p, q))) time. It is worth mentioning that the above results 
can be rephrased as a 0(l/e) algorithm if we are interested in distinguishing between the cases where the 
distance is greater than e or less than e/2. 

We now prove a corresponding lower bound that shows that our algorithm is tight. Note that while it is 
relatively simple to see that there exists two distributions that are indistinguishable with less than o(l/l\) 

4 Since / is typically monotone, i.e., if the likelihood ratio is closer to 1, then the distance decreases, and so boundedness reduces 
to liniu^o f{u) and limn-,00 exist and is at most r < 00. Note that from continuity lim u ^i /(it) = /(l) = 0. 
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oracle calls, it requires some work to also show a lower bound with a dependence on e. Further note that 
the below proof also gives analogous results for JS,Hellinger and A. (This is because for any pair of 
probability distributions p and q such thatp^ ^ q% => min{pj, qi) = then £i(p, q) = A(p, q) = JS(p, q) = 
Hellinger(p, q).) 



Theorem 10 (£i Lower Bound in the Combined Oracle Model). Any approximation up to a 1 + e factor 

^7 



time. 



of the l\ distance between two distributions in the combined oracle model requires Q( 
Proof. Let p and q r be the distributions on [n] described by the following two probability vectors: 



k/e 



k/e 



P 



(1 - 3o/2, 3ae/2k, 3ae/2k, 0, . . . , 0) and q r = (1 - 3a/2, 0, ... 0, 3ae/2k, 3ae/2k, 0, . . . , 0) 



Then £i(p, q k / 3e ) = a and £±(p, g fc / 3e + fc ) = a (l + 3e). Hence to 1 + e approximate the distance between p 
and q r we need to distinguish between the cases when r = k/3e(=: ri) and r = k/3e + k(=: r%). Consider 
the distributions p' and q rl formed by arbitrarily permuting the base sets of the p and q r . Note that the l\ 
distance remains the same. We will show that, without knowledge of the permutation, it is impossible to 
estimate this distance with o(l/(e 2 a)) oracle calls. We reason this by first disregarding the value of any 
"blind probes", ie. a probe probe(p', i) or probe(g', i) for any i that has not been returned as a sample. 
This is the case because, by choosing n 3> k/(ae 2 ) we ensure that, with arbitrarily high probability, for any 
o(l/(e 2 a)) set of i's chosen from any n — o(l/(ae 2 )) sized subset of [n], pi = q\' = 0. This is the case 
for both r\ and r^- Let / = {i : pi or q\ = 3ae/2k} and I\ = {i e / : Pi ^ q\}. Therefore determining 
whether r = t\ or r2 is equivalent to determining whether = 1/2 or 1/2 + g^^- We may assume 

that every time an algorithm sees i returned by sample(p) or sample(g), it learns the exact values of 
Pi and qi for free. Furthermore, by making k large (k = w(l/e 3 ) suffices) we can ensure that no two 
sample oracle calls will ever return the same i G I (with high probability.) Hence distinguishing between 
|Ii|/|/| = 1/2 and 1/2 + is analogous to distinguishing between a fair coin and a = 0(e) biased 
coin. It is well known that the latter requires /e 2 ) samples. Unfortunately only 0(1/ a) samples return 
an i £ I since with probability 1 — 3a/2 we output an i I when sampling either p or q. The bound 
follows. □ 

3.4 Testing Entropy (Combined Oracle) 

In this subsection we show that a simple algorithm achieves the optimal bounds for estimating the entropy in 
the combined oracle model of property testing. Note that this simple algorithm improves upon the previous 
upper bound of Batu et al [ 1 1 ] by a factor of log n/H where H is the entropy of the distribution. The authors 
of ID | showed that their algorithms were tight for H = 0(log n); we show that the upper and lower bounds 
match for arbitrary H. The algorithm is presented in Figure [5] It is structurally similar to the algorithm 
given in [ 1 1 1 but uses a cutoff that will allow for a much tighter analysis via Chernoff bounds. 



Algorithm Combined Oracle Entropy Testing 

1. E <- for t = 1 to m: 

2. do i <— sample(p) 

3. pi <— probe(p, i) 

4. if Pi > 4r then a <— l ° s , x ^ Pi else a 

— j log n 

5. E <— a + E 

6. Estimate the entropy as 3E log n/m 



Figure 3: Combined Oracle Entropy Testing 

The next lemma estimates the contribution of the unseen elements and that leads to the main theorem about 
estimating entropy in the combined oracle model. 
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. In particular, 



Theorem 12. In the combined oracle model we can 1 + e estimate the entropy H(p) of a distribution using 
0( 6 ttfffas ) where H is the entropy of the distribution. 

Proof. We restrict our attention to the case when H(p) > l/n and e > 1/y/n since otherwise we can 
trivially find the entropy exactly in 0(l/e 2 H(p)) time by simply probing each of the n p^s. Consider 
the value a added to E in each iteration. This is a random variable with range [0, 1] since p^ > l/n 3 
guarantees that l ^\^ P n l < 1. Now, the combined mass of all pi such that p. L < l/n 3 is at most l/n 2 . 

Hence since l/n 2 < \ 0& n-\og'f/2H{p) ^ lemma the maximum contribution to the entropy from such 
i is at most eH(p). Hence the expected value of a between (1 — e/2)H(p)/3 log n and H(p)/3 log n and 
therefore, if we can l+e/2 approximate E (a) then we are done. We use the probability concentration lemma 
|3]to get that P (\E - mE (a) | < (e/2)mE (a)) < 2e~( e / 2 ) 2mE ( a )/ 3 . Therefore with 0(l/(e 2 E (a))) = 
0(log n/e 2 H{p)) samples/probes the probability that we don't 1 + e/2 approximate E (a) estimate 
can be made arbitrarily small. □ 

4 Data Streams 

4.1 The Data Streams Model 

The data stream model characterizes small space algorithms that can access the read-only input in order. 
The algorithm makes passes over the input; any item not explicitly stored is inaccessible to the algorithm in 
the same pass. In many cases the number of passes is limited to one. The crucial aspect of a (regular) data 
stream algorithm is that the algorithm is required to produce a correct output for an arbitrary permutation of 
the input stream. 

As mentioned in the introduction, a random stream algorithm is a data stream algorithm that reads a 
randomly permuted input from its read-only input tape. Alternate definitions are possible, but this definition 
dates back to Munro and Paterson [33 1 and we will restrict ourselves to this definition. (It was also appear 
in [20].) All other features are the same as a general stream algorithm. As usual, the complexity of the algo- 
rithm is measured primarily in terms of the amount of space used on the work tape (for which the algorithm 
has random read-write access). 

Modeling stream distributions: There are two ways in which a data stream can be considered to define 
a probability distribution p. These are the update data stream model and the aggregate data stream model. 
Firstly we discuss the update data stream model. We are given a base domain [1, . . . , n] over integers and 
a function f p Q is specified as (p, i, +) which corresponds to f p (i) <— f P (i) + 1. This is the model used by 
Alon, Matias, and Szegedy in ]2). The model naturally captures f p (i) <— f p (i) + A,, however we do not 
consider f p (i) <— f p (i) — 1 (deletions) since the negative term does not correspond to any operation over 
distributions. 

An alternate model is the aggregate model where the input is (p, i, f p (i)}. This is the model used by 
Feigenbaum et al [23 1 for t\ differences 5 . In this model, computing the entropy is trivial and the Hellinger 
distance reduces to computing the £ 2 norm 6 . Note that this implies that we can compute the Jensen-Shannon 

5 Later generalized by Indyk 1 26 1 to the update model. 

6 Suppose we are given streams X and Y representing distributions P and Q respectively. Let f p (i), fq(i) be the number of 
occurrence of item i in the two streams and let m x — f v (l),m y = f q (i) be the lengths of the streams respectively. The 
(squared) Hellinger distance is the (\ norm of the difference of vectors u, v where m = yj f P (i) /m x , v% = \J fq(i)/m y . This 
difference can be easily computed by sketching (a la Johnson Lindenstrauss 1 28 1, see 1 26 1), maintaining the product Au (and scaling 
by %/rrix at end of input). 



6 



Divergence, the Triangle-divergence upto a constant factor as well. Obtaining a PTAS for them remains 
an interesting open question even in this simpler model. However, the aggregate model is restrictive for 
distributions, since the aggregation loses the "distribution" aspect. We will focus on the update model, and, 
as we argued above, will only consider insertions. 

Random Streams and two distributions: When we are computing a function of two distributions p and q 
we also have a function f q () specified by data items (q, i, +). Note that, for a random stream algorithm, we 
consider the random permutation to be over (q, i, +) and (p, i, +) together. We will assume that ^ f p (i) 
and J2i fq(i) are within a constant factor of each other. 

4.2 Simulating the Combined Oracle Models 

In the next section we will discuss how algorithms that make (combined) oracle calls may be simulated in 
the various streaming models. In particular this leads to the following theorem. 

Theorem 13. In two passes of a regular stream, there exist algorithms that, 

1. (1 + e) approximate the entropy H using 0(-?jj) space. 

2. (1 + e) approximate a r bounded symmetric f -Divergence Df using 0{ ^/ D ) space. 
If the stream is randomly ordered then one pass suffices in each case. 

Unfortunately, it is sometimes unrealistic to assume more than a single pass over the data. Hence we now 
concentrate on single pass algorithms. In what follows we present a single pass, polylog space, asymptotic 
constant factor approximation in a single pass. We then briefly discuss a single pass algorithm that uses 
0(n a ) space and achieves a (1 + e)^ approximation. Finally we present one pass algorithm for the random 
streaming model whose memory needs are not in terms of 1 / H 

4.3 An Asymptotic Approximation Scheme for Estimating Entropy in Regular Streams 

To construct our algorithm we will use algorithms for approximating the F$. There has been a long history 
of papers for computing the frequency moments of streams. We focus our attention to the best known 
(e, 5) approximation algorithm of Bar-Yossef et al [9 1, where Fq is approximated upto a factor (1 + e) with 
probability 1 — 5. Their result shows that the (e, 5) approximation can be performed in 0((-p log log n + 
logn) log 4) space. We will only focus on the fact that the space bounds are poly-logarithmic. The basic 
intuition of the algorithm is similar to the sublinear time minimum spanning tree algorithm of Chazelle 
et al [15 1 and the streaming geometry algorithms by Indyk 1271 . The idea is to count objects at various 
resolutions. 

Our algorithm works by randomly generating conceptual sub-streams from the data stream. Each sub- 
stream has a associated level j and we will perform the random generation of a sub-stream of level j in such 
a way that we only expect elements i with pi > 2~- J to appear in the stream. We will feed each sub-stream 
into an algorithm for estimating the number of distinct elements in the sub-stream. Summing up the these 
estimates (appropriately scaled) will give our estimate. The net result will be an asymptotic approximation 
scheme of factor i.e., for H sufficiently large (but constant) 7 . The algorithm is presented in Figure|4] 
The centerpiece of the algorithm are the following two Lemmas which are proved in Appendix |X] The 
bounds follows from the accuracy of our guess rh for m. Let Xij be the event that item i showed up in level 
j. Note that Xij are independent. 

Lemma 14. Ifm = Q.(^),p^ < 1 andt > (1 + e) 2 then (1 - \)^ < Pr[ Xij = 1] < (1 + e) 2 ^. 

7 This is in the same spirit as the approximation algorithms for bin-packing, checking for packing in 2 bins is NP-HARD, but we 
have a PTAS as the number of bins is large. 
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Algorithm Entropy-Estimation 

1. Guess rh = 1, (1 + e), . . . (1 + e) 1 where rh is roughly the number of insertions. 

2. Initially m = 0, we will keep track of m. 

3. for each guess of rh 

4. do for each stream item (i, +) 

5. do For j = l,...k, where k ~ log j, toss a coin with probability (independent 

across j) and include i in sub-stream j. t ~ (1 + e) 2 is almost 1. 

6. Maintain appropriate structure for computing Fo in each of these k sub-streams. 

7. Choose the set of structures corresponding to the guess (1 + e)rh > m> rh. 

8. For this rh, for each j estimate the number of distinct items fj = -Fo(j)- 

9. Output V ; ./- V. 

Figure 4: The Fq Algorithm for Estimating Entropy in a Data Stream 



Lemma 15. \(1-\)(H-1) < E 

We assume that the length of the stream is m n/e but that m is polynomial in n. Now observe that 
Y2j fj/^ is a sum °f many (bounded) Bernoulli variables. We can easily apply Chernoff and show with 
probability 0(exp(-e 2 c H)) the sum £\ E [(1 -§)(#- 1)(1 - e c ), (f (1 + e) 2 # + 2)(1 + e c )] . We 
need to repeat the above for O (log n/e c H) times for high probability. Now, we lose a further 1 ± 6q factor 
in the Fo estimation. Overall, we maintain 0(log n/e) different Fo estimation structure, each of which takes 
space 0((1/6q) log log n + log n) log n and succeed with high probability. Finally there are log n/e guesses 
for rh. The overall space bound is 0((log 3 n/(ee c H))(-^ loglogn + logn)). 

Theorem 16. We have a polylog space asymptotic + e approximation for the entropy. 

A natural question is if the above analysis is tight. It is possible to slightly improve the bounds in Lemma[T5l 
but the problem is that there will always be constant shift between the entropy and our estimate. There is a 
natural bias in the estimation entropy and this particular method alone is unlikely to yield better results. 

4.4 A True PTAS (at a Price) for Estimating Entropy in Regular Streams 

A final natural question remains, namely, can we achieve a PTAS for entropy in the regular data streams 
model? The answer is yes, but, it comes with a price. The space bound for ^ approximation increases 
over the space bound of the combined oracle model by a factor approximately n a /H. The algorithm and 
its analysis is presented in Appendix |Bl This algorithm is roughly analogous to the algorithm presented in 
Sec. 14.51 The difference lies in the way it divides the elements into "large" and "small" classes. It proceeds 
in two steps (i) Finds the elements with large f(i) and estimate them and (ii) Uses a worst case bound for 
small f(i). The first step is achieved by a careful sampling technique reminiscent of the online facility 
location algorithm of Meyerson [32| in the context of stream clustering. There are also some similarities 
with the count/count-min sketches of [ 14l fT6l . 

Theorem 17. We can estimate the entropy H upto a factor + e') using space where f(e', H) 

is the value of e satisfying, e' = a ( €lo s( n M+ n — ) _|_ ^g^ 1 A _ 

4.5 A PTAS in the Random Streams Model 

Emulating the combined oracle algorithm is only inefficient when the entropy is very small. But when the 
entropy is small, it is easy to see that there must be one element with probability mass as 1. The high level 



E 



2' 



< £^t£l H + 2 where H is the entropy. 
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idea of our algorithm is to keep track the exact counts of the elements of a certain set A; and establish that 
the projection of the distribution (rescaled to 1) to the complement of A has large entropy. On this projection 
we run the simulation of the combined oracle algorithm. However, we have several problems; (a) when we 
discover that entropy is large we have already a few elements from the projection - we have a dependence, 
(b) the projected distribution may again have one element with mass « 1, and (c) we will not know the exact 
probability mass of any set till the end of input. But, 

Proposition 18. In the random stream model if we consider the first s elements which do not belong to A, 
we have a random sample (without replacement) from the distribution projected to [n] — A. 

Using Proposition ED we can show that either estimates are sufficient for (b) and (c), or the stream has 
few distinct elements, which we count exactly. 8 At the end of input we will know the exact probability mass 
of A and rescale/shift to get the contribution of the projection to the original distribution. The proof of the 
next lemma is in Appendix 1X1 

Lemma 19. We can eliminate the dependence between MS and the simulated property testing algorithm. 

The algorithm: At any point of time we are maintaining a set of items A. For each i S A we are counting 
the number of occurrences of i. We keep seeing elements in the stream until we accumulate a multi-set MS 
of size ci log n of elements that do not belong to A and appear earliest in the stream for some large constant 
ci . We check if any i' G MS occurs at least ^ log n times in MS. 

• If there is no such i' (projection has large entropy), we proceed to simulate the t queries with the 
correction due to M as described above (maintaining counts of the elements in A as well). 

• If there is any such i' , let A — ► A U distinct (MS). We operate on the rest of the stream with this new 
A. Observe that the invariant of maintaining the count of elements in A can be maintained. 

We also store the last 0(e~ 2 logn) elements in the stream (also a multi-set) seen at all time. 

Note that we can assume that the remaining stream is much larger than MS - we maintain the last 0(e~ 2 log n) 
elements and would discover that the remaining stream is small within our alloted space. But in that event 
we would have an exact description of the distribution of the items in the stream. The next lemma follows 
from the assumption, Proposition [T£l and the Chernoff bound. 

Lemma 20. Assuming that the remaining stream is much larger than MS, if there is an item in the projection 
with mass > \, then it occurs at least | fraction in MS with high probability polynomially close to 1. 

The theorem follows from the fact that either we take out probability mass quickly, or shift to the simulation 
phase. The proof is in Sec.lAl 

Theorem 21. We can compute a (1 + e) -approximation for the entropy in a random stream model using 
0((log n + e 2 ) log n) space. 

We can tighten the above by noticing that if the number of distinct elements in MS is r + 1 then entropy is 
Q.(r/ log n) and we can shift to the second phase. We omit the discussion. 

We can view the setting as a "robust distribution" as in Gilbert et al \2A\ 
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5 Connecting Oracle Models and Streaming Models 

We direct the reader to [6| for a detailed treatment of the relative computational power of the data stream, 
sketch (see [6| for a definition) and generative sampling models. Here we restrict ourselves to comparing 
the combined oracle model with the streaming model. 

We say that a function f(p) = f(pi,P2, ■ ■ -Pn) is symmetric (or permutation-invariant), if / remains 
unchanged when its arguments are permuted. (Symmetry is a desirable and often-assumed property of func- 
tions on distributions, and is a special case of general invariance under coordinate reparametrizations Kl3i .) 
We will show that we can always express the algorithm in a canonical form where the algorithm /(rs? samples 
and then probes the samples along with a few other elements. The idea would be to view the original algo- 
rithm, after the sampling stages and probing of the samples, as a randomized decision tree that we rewrite to 
an oblivious decision tree along the lines of Bar-Yossef et al iflQl and simulating this new decision tree 
in the random stream model. We start with the necessary definitions. 

Definition 22. A randomized decision tree that computes a function f(x) is defined (as usual) as a decision 
tree having three types of nodes; a query node that asks for the value of an input parameter and maps the 
resulting value (and the history of all queries upto that point) to a choice of child node to visit, a random 
choice node, where the child node is chosen at random, and output nodes, where an answer expressed as a 
function of all queries thus far is returned. 

Definition 23. An oblivious decision tree is one where the queries are made independent of the input, or the 
random choices in the algorithm. Formally, suppose we have a tree T with worst-case query complexity u. 
Then an I-relabeling ofTbyI={ix,... i u } relabels all query nodes of depth j by the query to ij, yielding 
the tree T 1 . An oblivious decision tree is then a pair T, A u , where T is a decision tree with worst-case 
complexity q and A u is a distribution on [n] u . A computation on an oblivious decision tree consists of two 
steps: (1) sample u elements I from A q , (2) Relabel T to T 1 and run it on input x. 

The first lemma shows how any combined oracle tester can be transformed (with only a slight blow-up 
in complexity) to one of a canonical form. The proof is is deferred to Appendix lAl 

Lemma 24 (Canonical Form Algorithm). Let A be a combined oracle property testing algorithm that 
1 + 7 estimates a symmetric function f(p) using (worst-case) t oracle calls and probability of error 5. Then 
there exists a canonical algorithm A' that uses (worst case) 0(t) oracle calls with equal performance. 

We are now ready to prove the main structural result of this section. The proof is deferred to Appendix lAl 
The main idea for simulating in two pass regular stream model is to sample in the first pass and then do exact 
counting in the second pass. For the random order stream result we are able to do both the sampling and 
exact counting in the same pass by using, roughly speaking, the prefix of the random order stream as a 
source for sample oracle queries. 

Theorem 25. Let p be the probability distribution described by an update data stream. Let Abe a combined 
oracle property testing algorithm that makes at most t oracle calls to 1 + 7 estimate a symmetric function 
f(p) with probability of error at most 5. Then there exist a single pass random stream algorithm and a two 
pass regular stream algorithm that use 0(t log m log n) space with equal performance. 

The proof can be generalized to the case when we are computing a function of two distributions f(p, q), 
e.g., a distance between two distributions. In this case we consider / as a function over n tuples, ie. f(p, q) = 
/((Pi; <7i)j (P2, 12): ■ ■ ■ {Pni In))- f is symmetric is it is invariant of permutations of the ra-tuples. The only 
important caveat is that we need 2~2ifp(^) = /?(*)) sucn tnat > with high probability, there are t 

elements of the form (p, i, +) (for some i) and t elements of the form (q, i, +) (for some i) in the first 0(t) 
data items. 
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A Miscellaneous Proofs 

Proof of Lemma\5\ Wlog. let p~i > q~i. hetpi = (1 + Ji)pi, q% = (1 + 72)ft and 7' = 7/IOO. 
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Wlog, > qi. Now if pi, qi > n~ a we know that | -yi | , I72I < 7' can thus we are bounded above by pij/10. 
The problem is that is qi <^ n~ a then I72I could be very large. However in this case qi/pi will be very small 
so we can still bound the above well. Formally, let qi = \\n~ a and n~ a = \2Pi- Note that < Ai < 1 and 
< A2 < (1 + 7). Now, note that if m samples is sufficient to 7 estimate a probability of size n~ a up to 
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an additive error a then if is sufficient to estimate a probability of size \\n a up to an additive error of 
'y\/~\in~ a , ie, 72 < -3==. Hence we get 

| (2 72 + 72 2 )(A 1 A 2 ) 2 | + |2(AiA 2 )(7i+ 72 + 7172)1 + |(27i + 7i 2 )l + + 

< (2 7 'Ai /2 A^ + 7 2 AiA| + 2 7 'AiA 2 + 27'yATA 2 + 2 7 2 y^Aa + 3 7 ' + 2 7 ' 2 

< (2 7 '(1 + 7 ') 2 + 7 ,2 (1 + 7) 2 + 27 (1 + 7 ) + 27 (1 + V) + 27 2 (1 + 7 ) + 37 + 27 2 

< 7/10 



for small 7. Hence we again get that %-crH < K7/10. □ 

Proof of Lemma®. Let A = YlieS ^p~+q- an< ^ & = Si^s ^ P p~+q- • By Lemma|5]we estimate A with an 
additive error of e/ 10. Now, maxdlp'Hoo, H^'Hoo) < n~ Q (l + e). If A(p,q) > e then either A is bigger than 
e/2 or B is bigger than e/2. If j4 is bigger than e/2 then our estimate of A is bigger than e(l/2— 2/10) and in 
which case 

£. gS > e /io and we fail. Otherwise if is bigger than e/2 then i%(jpl ', g') > -4=^ and 

the £2 test fails since ^/nP^ip' , <?') > e/2- Alternatively if A(p, g) < e 2 /n l ~ a then ^4 < e 2 /n 1 ~ a and we 
don't fail at the first hurdle. Also 1? < e 2 /n 1 ^" implies the second test passes since n Q ^ 2 (p', g') < e 2 /n 1 ~ a 
and thus ^(p', g') < (appealing to Eq.[l]) □ 

Proof of Lemmaim We use ideas from the simulation in Theorem |2~51 We want to find t = 0(e~ 2 logn) 
elements from the projection at random with replacement (H^ 1 is constant). Let MS be the multi-set of 
items which allowed us to conclude that the entropy is large. We keep the subsequent counts of all the 
elements in MS. We also look at the first t elements of the projection (irrespective of membership in MS), 
denoted by Prefix and keep track of the new elements seen. At the end of the stream, we know the length 
m, and can now simulate the combined oracle algorithm using MS and Prefix. The first element of 
Prefix is almost a random sample, with a corrective term that depends on the size of MS and m. If there 
is a correction, then we choose a random element of MS. If we do not correct, we output the first element 
of Pref ix, add it to MS and shift the Prefix to the next element and repeat the process. □ 

Proof of Theorem 1771 If there is no such i' , we are guaranteed (whp, with high probability) that the entropy 
H of the projection satisfies H > 1 and the simulation is successful whp as well. If there is such an i' then 
whp we decrease the mass of the projection by factor 2. This can repeat for at most 81ogm = O(logra) 
steps (m is the stream length) since after that the probability mass of the residual distribution is smaller than 
a factor 1/m 2 of the original distribution. The contribution of these elements towards the entropy H(p) of 
the original distribution is at most 21ogm + log " . But if there are more than one element in the distribution 
then the minimum entropy is log mm, since worst case is when m — 1 elements are the same and 1 element 
is different (follows from concavity). So we can ignore the contribution of these elements (note that after 
the first projection we are guaranteed that there are two distinct elements in the stream). □ 

Proof of Lemma VH\ The RHS follows from the fact that for sufficiently large m, we have (1 — x/m) > 
e -( 1+ V^)^_ Therefore Pr[ Xij ] < 1 - (l - ^^^Y^ < 1 - e~( 1+£ ) 2 ^ < (1 + e) 2 ^f since (1 + 

e )2£|i < 1 xhe L HS follows analogously, Pr[ Xij } > 1 - (l - iQ^' > 1 - e - ^ > (1 - □ 



13 



Proof of Lemma\T5\ Consider E 
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which proves the lemma. 



□ 



Proof of Lemma\24} Note that sample does not take a parameter and therefore only the number of samples 
we make can depend on the outcome of probes we may do. However, we know that there can be at most t 
samples taken. Hence if we request t samples initially we can assume that we do not need to do any further 
sampling. (Note that we have at most doubled the oracle calls). Let S be the set of i's seen as samples. 
Obviously taking more samples than were taken by A can not be detrimental when it comes to correctness. 
Then, for each value i £ S we perform probe(p, i). This only adds t queries to the complexity. 

We now have a randomized algorithm that takes as input the outcome from our t samples and the value 
Pi for alH £ S and performs a series of probes pr obe(r3, j) (wlog j 5). Note that since all samples have 
already been made, this phase of the computation can be viewed as a randomized decision tree. At each 
node, the algorithm either tosses a random coin or makes a query probe(/j, j), and moves to one of the 
children of this node based on all values seen thus far (we assume the algorithm makes no decision based 
on the value of i, the so-called variable-oblivious model). 

We now invoke a simplification of a lemma due to Bar-Yossef et al fTOl loTl. as follows: 



Lemma! |6| Lemma 4.17) Let T be a randomized decision tree that computes an epsilon-approximation 
to f (where f is symmetric) with u queries in the worst case and ue queries in the expected case (with the 
expectation taken over the random choices used by T). Then there is an oblivious decision tree (T, W u ) 
(where W u is the uniform-without-replacement distribution) of worst-case query complexity u and expected 
query complexity ue that computes an epsilon-approximation to f. 
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Our strategy is therefore as follows: Rewrite the randomized decision tree that generates t probes to 
be an oblivious decision tree. In such a tree, all queries can be decided in advance and we now have an 
algorithm of the desired canonical form. □ 



Proof of Theorem[l5\ Consider the following streaming algorithm that uses O(tlognlogm) spaces. We 
store the first t items in the data stream, Pref lx t = ((p, i\, +), {p, +), . . . (p, it, +)). Now for each 
i G S w = {ii, %%, . . . it} we set up a counter that will be used to maintain as exact count of the frequency of 
i. We now chose t values, S' w from k € [n] \ S w uniformly at random (without replacement) and set up a 
counter for each of these t values. We also maintain a counter to estimate the length of the stream m. At the 
end of the data stream we claim that we can simulate the oracles calls made by any canonical algorithm. The 
only difficulty in establishing this claim is showing that we can use the Pref lx t to simulate the generative 
oracle. Ideally we would like to claim that we can just return ij on the jth query to the generative oracle. 
This is however not the case. Eg. if the frequency of the element 10 is /io and the length of the stream is m 
then, if i% = 10 then the probability that 22 = 10 is ^"T 1 7^ = P10 whereas, in the generative oracle, the 
event that the second sample is 10 is independent of the first sample. We get around this dependence in the 
random stream model by doing the following: on the jth generative oracle call we output ij with probability 
(m — j + l)/m and otherwise output ij/ where j' is chosen uniformly at random from [j — 1]. We thus can 
emulate the generative oracle calls made by A. The emulation of the evaluative oracle calls, ie. the probes 
performed by A, can also be emulated because for each ij we have maintained counters that give us pi . and 
for each k £ S' w we know p^. 

For the two pass regular stream algorithm things are simpler. In the first pass we generate our random 
sample (with replacement) using standard techniques. In the second pass we count the exact frequencies of 
the relevant i. □ 

B The Large-Small algorithm for Entropy 

The algorithm we will use can be described as a block-by -block process or a much more elegant simplifica- 
tion reminiscent of online facility location algorithm of Meyerson [ 32 ] in the context of stream clustering. 
There are also some similarities with the count/count-min sketches of lfT4lfT6l . Our algorithm will "track" 
a few items, i.e., maintain explicit counters for them. Let 1 > a > be a constant to be decided later. Let 
H be the true entropy and pi the true probability of i. Assume m > n > 3. The algorithm is presented in 
Figure 13 

Lemma 26. Let the true entropy be H, then H > H. 

Proof. The fact that we never underestimate the entropy follows from the fact that we never overestimate the 
probabilities. The fact that we distribute the "residual" properties among the untracked elements in uniform 
way, ensures that the lemma holds. We can also derive an algebraic proof, but the information theoretic 
proof is simpler. □ 

In the rest of the subsection we will show that we do not overestimate the entropy, i.e., H 5^ H. The 
proof that we correctly track the "large" probability elements is straightforward. 

Claim 27. If 2m > m > m (for appropriate guess m) and pi > n~ Q /2 then i is tracked with probability 
1 — n~ 2 , furthermore, \c(i)/m — Pi\ < en~ a . 

Notice that the size of S is at most n a , hence we have, 

Corollary 28. Let S = {i\pi < n~ a } and w = ^2 i£ sPi- Then whp., S = S and w > w > w — e. 

The proof of the next claim is not too difficult, but there is a small issue that x log \ is a bitonic (increas- 
ing and then decreasing) sequence with maxima at 1/e. 
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Algorithm Entropy-Estimation 

1. Guess rh = 1, 2, 4, ... 2* where to is roughly the number of insertions. 

2. For each guess of rh: 

•Initially m = 0. 

•On new input (i, +), increase m. If m > 2m abort process. 

•If i is being "tracked", increase counter of i. Else with prob. J"!^ decide to start 
tracking i. 

•Keep checking if at least 2 elements are seen, i.e., the entropy is non-zero. 

3. At the end of input focus on the guess where 2m > m > rh, since we know to. 

4. Define the collection of tracked objects (i, c(i)} as the "signature" of the stream. 

5. If c(i) is the count of i, let S be the set of tracked elements with c(i) > n a . (The 
choice of the notation implies S is the set of "small" probability elements.) Let S be 
the complement of S. 

6. For i 6 S define h { = £® log ^ 

7. Let w = 1 — c{i)/m. Define H§ = w log ^. We will assume n > 3. 

8. Output H = Hg + y~]hj. 



Figure 5: The Large-Small Algorithm for Estimating Entropy in a Data Stream 



Lemma 29. Whp. J2 ie§ h(i) < (1 + e) E ie sPi lo § i + 2 ™~ a - 

Proof. For all probabilities less than 1/ewe can at most suffer a loss of (1+e) factor since we underestimate 
the probability. But for pi > 1/e the derivative is bounded and at most —1. This applies to at most two 
elements. Observe that the worst case arises when Yli^sPi 1°§ ~ ( wn ich is a convex function) is as low 
as possible. This means that the worst case is when exactly one p-i ~ 1. But the derivative of pi log — is 

Pi 

bounded and \pi log ^ — h(i) \ is at most 2(pj — c(i)/m) which is bounded by Claim|^] □ 
Now let us focus on the small probability elements. 

Lemma 30. Let H$ = ^2 Pi log — (the true contribution of small elements) then H§ — elog ~ < ^(1 + 

log n > ° 

Proof. For each i G S, pi < n~ a . H$ is minimized if given Pi,pj we set p' { = Pi + pj and p 1 - = (by 
convexity). Now to respect pi < n~ a , the vector {pi} must have pi = n~ a or pi = 0, for all pi except 1. 
This last i takes care of the excess, i.e., p\ = w — [^^\ n~ a . Therefore 



We immediately have two subcases 



n a n a w — -^77 n 
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Case w < n a : In this case H$ > w log and 

ra n n n log n + log - log n + log - 

Ha — e log — < (w + e) log e log — < w log — < Hs ; < Hs ; - 

° e w + e e e — log w a log n 

Case w > rT a \ In this case Hs > 2 ™- a log n a and therefore, 

n n log n + log - log n + log - 

Hg — e log — < w log — < 2Hsn : < Hs : (for large enough n) 

s e e alogn alogra 

Thus the first case dominates and the lemma is true. □ 

Putting together the Lemmas l29l & l30l we get: 



Theorem 31. We can estimate the entropy upto a factor -(1 + j^f^) and an additive term e(log ~ + n a ) 
in space 0(e~ 1 n a log n). 

We will assume that H is bounded away from 0. Let e' = a<y€ lo g( n A) +ra — ) _|_ ^g^A anc i \ & [ j( e ' ; jj'j b e 
equal to the e satisfying this expression. 



Theorem 18. We can estimate the entropy H upto a factor -(1 + e') using space 0( n f ,°^. ). 
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